Monocular 3D human pose estimation is quite challenging due to the inherent ambiguity and occlusion, which often lead to high uncertainty and indeterminacy. On the other hand, diffusion models have recently emerged as an effective tool for generating high-quality images from noise. Inspired by their capability, we explore a novel pose estimation framework (DiffPose) that formulates 3D pose estimation as a reverse diffusion process. We incorporate novel designs into our DiffPose that facilitate the diffusion process for 3D pose estimation: a pose-specific initialization of pose uncertainty distributions, a Gaussian Mixture Model-based forward diffusion process, and a context-conditioned reverse diffusion process. Our proposed DiffPose significantly outperforms existing methods on the widely used pose estimation benchmarks Human3.6M and MPI-INF-3DHP.
translated by 谷歌翻译
细粒度识别的目的是成功区分具有微妙差异的动作类别。为了解决这个问题,我们从人类视觉系统中获得灵感,该系统包含大脑中专门用于处理特定任务的专业区域。我们设计了一个新型的动态时空专业化(DSTS)模块,该模块由专门的神经元组成,这些神经元仅针对高度相似的样品子集激活。在训练过程中,损失迫使专门的神经元学习判别性细粒差异,以区分这些相似的样品,从而改善细粒度的识别。此外,一种时空专业化方法进一步优化了专业神经元的架构,以捕获更多的空间或时间细粒信息,以更好地解决视频中各种时空变化的范围。最后,我们设计了上游下游学习算法,以优化训练过程中模型的动态决策,从而提高DSTS模块的性能。我们在两个广泛使用的细粒度识别数据集上获得了最先进的性能。
translated by 谷歌翻译
在许多应用中,人类互动识别非常重要。识别相互作用的一种关键提示是交互式部位。在这项工作中,我们提出了一个新型的交互图形变压器(Igformer)网络,以通过将交互式身体部位建模为图形,以用于基于骨架的交互识别。更具体地说,所提出的Igformer根据交互式身体部位之间的语义和距离相关性构造了相互作用图,并通过基于学习的图来汇总交互式身体部位的信息来增强每个人的表示。此外,我们提出了一个语义分区模块,以将每个人类骨架序列转换为一个身体零件序列,以更好地捕获用于学习图形的骨骼序列的空间和时间信息。在三个基准数据集上进行的广泛实验表明,我们的模型的表现优于最先进的利润率。
translated by 谷歌翻译
早期动作预测旨在在完全执行动作之前成功预测其类标签。这是一个具有挑战性的任务,因为不同动作的开始阶段可能非常相似,只有微妙的歧视差异。在本文中,我们提出了一个新颖的专家检索和组装(ERA)模块,该模块检索并组装了一组最专业的专家,该专家最专门使用歧视性微妙差异,以将输入样本与其他高度相似的样本区分开来。为了鼓励我们的模型有效地使用细微的差异进行早期行动预测,我们促使专家仅区分高度相似的样本,迫使这些专家学会使用这些样品之间存在的细微差异。此外,我们设计了一种有效的专家学习率优化方法,可以平衡专家的优化并带来更好的性能。我们在四个公共行动数据集上评估了我们的ERA模块,并实现最先进的性能。
translated by 谷歌翻译
推荐系统(RSS)旨在模拟和预测用户偏好,同时与诸如兴趣点(POI)的项目进行交互。这些系统面临着几种挑战,例如数据稀疏性,限制了它们的有效性。在本文中,我们通过将社会,地理和时间信息纳入矩阵分解(MF)技术来解决这个问题。为此,我们基于两个因素模拟社会影响:用户之间的相似之处在常见的办理登机手续和它们之间的友谊方面。我们根据明确的友谊网络和用户之间的高支票重叠介绍了两个友谊。我们基于用户的地理活动中心友好算法。结果表明,我们所提出的模型在两个真实的数据集中优于最先进的。更具体地说,我们的消融研究表明,社会模式在精确的@ 10分别在Gowalla和Yelp数据集中提高了我们所提出的POI推荐系统的表现。
translated by 谷歌翻译
在本文中,我们提出了一种新的手工识别方法,以便犯罪调查,因为手形象往往是在严重犯罪如性虐待中的唯一可用信息。我们提出的方法,使用注意网络(MBA-Net)多分支,除了全球(不受注意)分支之外,还包含了分支机构中的通道和空间注意模块,以捕获歧视特征学习的全局结构信息。注意力模块侧重于手形图像的相关特征,同时抑制无关背景。为了克服注意力机制的弱点,等离性体到像素混洗,我们将相对位置编码集成到空间注意模块中以捕获像素的空间位置。对两个大型多民族和公共手部数据集进行广泛的评估表明,我们的提出方法实现了最先进的性能,超越了现有的基于手的识别方法。
translated by 谷歌翻译
在严重犯罪的情况下,包括性虐待,往往是唯一可以证明识别潜力的可用信息是手的图像。由于这种证据在不受控制的情况下捕获,因此难以分析。随着全局对特征比较的方法在这种情况下有限,重要的是要考虑当地信息。在这项工作中,我们通过学习全球和地方深度特征表示来提出基于手的人识别。我们提出的方法,全局和部分感知网络(GPA-Net),在Conv-Tother上创建全局和本地分支,以学习强大的歧视全局和零级功能。为了学习本地(零件级别)功能,我们在水平和垂直方向上对CONC层执行统一分区。我们通过进行软分区检索零件,而无需明确地分区图像或需要外部提示,例如姿势估计。我们对两个大型多民族和公开的手部数据集进行了广泛的评估,表明我们所提出的方法显着优于竞争方法。
translated by 谷歌翻译
人类行动识别(HAR)旨在理解人类行为并为每个行动分配标签。它具有广泛的应用程序,因此在计算机视觉领域引起了越来越多的关注。可以使用各种数据模式来代表人类的行动,例如RGB,骨骼,深度,红外,点云,事件流,音频,加速,雷达和WiFi信号,它们编码有用但不同信息的不同来源,并具有各种优势,取决于不同在应用程序方案。因此,许多现有作品都试图使用各种方式研究HAR的不同类型的方法。在本文中,我们根据输入数据模式的类型进行了对HAR深度学习方法的最新进展的全面调查。具体而言,我们回顾了当前的单个数据模式和多种数据模式的主流深度学习方法,包括基于融合的基于融合和基于共学习的框架。我们还在HAR的几个基准数据集上提出了比较结果,以及有见地的观察结果并激发了未来的研究方向。
translated by 谷歌翻译
Vehicle-to-Everything (V2X) communication has been proposed as a potential solution to improve the robustness and safety of autonomous vehicles by improving coordination and removing the barrier of non-line-of-sight sensing. Cooperative Vehicle Safety (CVS) applications are tightly dependent on the reliability of the underneath data system, which can suffer from loss of information due to the inherent issues of their different components, such as sensors failures or the poor performance of V2X technologies under dense communication channel load. Particularly, information loss affects the target classification module and, subsequently, the safety application performance. To enable reliable and robust CVS systems that mitigate the effect of information loss, we proposed a Context-Aware Target Classification (CA-TC) module coupled with a hybrid learning-based predictive modeling technique for CVS systems. The CA-TC consists of two modules: A Context-Aware Map (CAM), and a Hybrid Gaussian Process (HGP) prediction system. Consequently, the vehicle safety applications use the information from the CA-TC, making them more robust and reliable. The CAM leverages vehicles path history, road geometry, tracking, and prediction; and the HGP is utilized to provide accurate vehicles' trajectory predictions to compensate for data loss (due to communication congestion) or sensor measurements' inaccuracies. Based on offline real-world data, we learn a finite bank of driver models that represent the joint dynamics of the vehicle and the drivers' behavior. We combine offline training and online model updates with on-the-fly forecasting to account for new possible driver behaviors. Finally, our framework is validated using simulation and realistic driving scenarios to confirm its potential in enhancing the robustness and reliability of CVS systems.
translated by 谷歌翻译
Based on WHO statistics, many individuals are suffering from visual problems, and their number is increasing yearly. One of the most critical needs they have is the ability to navigate safely, which is why researchers are trying to create and improve various navigation systems. This paper provides a navigation concept based on the visual slam and Yolo concepts using monocular cameras. Using the ORB-SLAM algorithm, our concept creates a map from a predefined route that a blind person most uses. Since visually impaired people are curious about their environment and, of course, to guide them properly, obstacle detection has been added to the system. As mentioned earlier, safe navigation is vital for visually impaired people, so our concept has a path-following part. This part consists of three steps: obstacle distance estimation, path deviation detection, and next-step prediction, done by monocular cameras.
translated by 谷歌翻译